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Abstract. Future galaxy surveys require one percent precision in the theoretical knowledge 
of the power spectrum over a large range including very nonlinear scales. While this level of 
accuracy is easily obtained in the linear regime with perturbation theory, it represents a serious 
challenge for small scales where numerical simulations are required. In this paper we quantify 
the precision of present-day Wbody methods, identifying main potential error sources from 
the set-up of initial conditions to the measurement of the final power spectrum. We directly 
compare three widely used Wbody codes, Ramses, Pkdgrav3, and Gadgets which represent 
three main discretisation techniques: the particle-mesh method, the tree method, and a 
hybrid combination of the two. For standard run parameters, the codes agree to within one 
percent at k < 1 /iMpc“^ and to within three percent at A: < 10 /iMpc“^. We also consider 
the bispectrum and show that the reduced bispectra agree at the sub-percent level for k <2 
h Mpc“^. In a second step, we quantify potential errors due to initial conditions, box size, and 
resolution using an extended suite of simulations performed with our fastest code PkdgravS. 
We demonstrate that the simulation box size should not be smaller than L = 0.5 /i“^Gpc 
to avoid systematic finite-volume effects (while much larger boxes are required to beat down 
the statistical sample variance). Furthermore, a maximum particle mass of Mp = 10® 
is required to conservatively obtain one percent precision of the matter power spectrum. As 
a consequence, numerical simulations covering large survey volumes of upcoming missions 
such as DES, LSST, and Euclid will need more than a trillion particles to reproduce clustering 
properties at the targeted accuracy. 
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1 Introduction 

In the last decades, cosmology has entered the high precision regime owing to ever more accu¬ 
rate measurements of the Cosmic Microwave Background (CMB). The statistical information 
of the CMB sky is, however, intrinsically limited, while large scale structures contain a great 
wealth of modes which can be exploited, provided nonlinear structure formation is well under¬ 
stood. Next-generation galaxy and weak lensing surveys such as DES^, LSST^, and Euclid^ 
require percent accurate modelling of the matter power spectrum up to wave numbers of 
A: ~ 10 /iMpc“^ in order to fully exploit their constraining power for cosmology [1, 2]. 

Standard perturbation theory gives accurate results up to A: ~ 0.1 /i Mpc“^, while numer¬ 
ical simulations are indispensable at higher wave numbers [3, 4]. Pure dark matter simulations 
based on V-body techniques are believed to be accurate up to about k ~ 0.5 h Mpc“^, beyond 
which baryonic feedback from active galactic nuclei (AGN) must be included [5, 6]. 

In this paper we focus on the matter power spectrum from collisionless A-body simu¬ 
lations, ignoring all hydrodynamical effects. Although strictly not valid at small scales, this 
approach is currently the only option for precision cosmology as the relevant AGN feedback 
mechanism is not well understood, and is poorly constrained by observations. A potential 
way forward is to study AGN feedback with high resolution simulations of small cosmological 
volumes and to parametrise the effects on the matter power spectrum. Gosmological param¬ 
eter estimation can then be carried out on the basis of A-body simulations plus additional 
free model parameters accounting for the AGN contribution [6, 7]. 

^www.darkenergysurvey.org 
^www.Isst.org/lsst 
^sci.esa.int/euclid 
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Comparison studies of A/^-body codes and subsequent analysis tools have been performed 
in the past. The first investigations of different A^-body techniques was carried out in Ref. [8] 
more than thirty years ago. More recently, the authors of Ref. [9] compared high-redshift 
power spectra and halo abundances from mesh- and particle-based techniques, reporting sig¬ 
nificative differences at small scales. The first detailed code comparison study including six 
gravity codes was carried out by Ref. [10]. In terms of the power spectrum, the authors 
reported agreement of roughly ten percent between particle codes up to A: ~ 10, while mesh 
codes deviated already at smaller wave-numbers due to incomplete resolution. Three years 
later, another comparison project with 10 gravity codes was carried out by Ref. [11], show¬ 
ing further improvement in code agreement and stating roughly one percent differences for 
the power spectrum below /c ~ 1 h/Mpc. At larger wave-numbers they observed growing 
discrepancy between mesh and particle codes which exceeded ten percent at A: ~ 10 h/Mpc, 
while both methods were shown to agree to about five percent among each other. Finally, 
Ref. [12] reinvestigated the difference between mesh and particle methods using simulations 
with larger, cosmological viable box sizes and particle numbers. They confirmed the reported 
percent agreement up to A: = 2 h/Mpc. 

In the present paper we build upon these past efforts and compare three gravity codes 
representing the most widely used discretisation techniques. We thereby use an unprecedented 
setup in terms of box size and particle resolution to allow for a code comparison free of 
systematic effects. This is confirmed in the second part of the paper were we show that 
potential systematics from initial conditions, box size, and particle numbers are below the 
percent error condition. 

Sec. 2 is devoted to the code comparison, focusing on the auto and cross power spectra. 
In Sec. 3 we take a critical look at the simulation pipeline and investigate the accuracy of 
the initial conditions as well as potential finite volume and resolution effects. A summary of 
the results including a list of requirements to obtain percent accuracy of the matter power 
spectrum is presented in Sec. 4. In the Appendices, we investigate modifications of the code 
parameters (Appendix A) and we present a code comparison beyond the power spectrum 
(Appendix B). 

2 Code comparison 

The first part of this paper is about comparing A^-body codes with respect to the precision 
requirements of upcoming galaxy and weak lensing surveys. Our study is mainly focused on 
the auto power spectrum which is the prime statistical measure in cosmology. Additionally, 
we investigate phase-shifts in Fourier space by cross-correlating the results of different codes 
in order to further quantify the spatial disagreement between density fields from different 
A^-body techniques. 

2.1 A^-body codes 

The gravitational A^-body technique is the standard tool to simulate the nonlinear Universe, 
yielding accurate results at scales where hydrodynamical effects are subdominant. Most 
A^-body codes are either based on a particle-mesh method, a tree algorithm, or a hybrid 
combination of the two. In this paper, we compare the codes Rainses, PkdgravS, and Gadgets, 
which represent each of these three approaches and are widely used in the astrophysics and 
cosmology community. 
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The comparison is performed by running a simulation of box size L = 500 /i“^Mpc 
and resolution of = 2048 particles per dimension with each of the three codes, starting 
from the exact same initial conditions and using the standard run parameters described 
below. The initial conditions are based on second order Lagrangian perturbation theory 
(2LPT), generated at redshift 49 with a modified version of the IC code from [13, 14], For the 
cosmological parameters, we use Planck values, i.e., 12^ = 0.3071, JIa = 0.6929, life = 0.0483, 
h = 0.6777, Us = 0.9611, and erg = 0.8288 [15]. The measurement of the power spectra is 
performed at exactly the same redshifts and with the same analysis tool (using the triangular 
shaped cloud scheme for the mass assignment). In this way, we carefully avoid all other 
potential sources of error and directly compare effects due to the gravity calculations of the 
codes. 

We now briefly present the three codes and give details about the run parameters for 
the comparison: 

• The Wbody and hydrodynamical code Ramses [16] is based on a particle-mesh technique 
and uses adaptive mesh refinement for high density regions. The code is mainly used 
for hydrodynamical simulations in a cosmological context [17, 18] but it has also been 
employed for pure dark matter Wbody runs in the past [19, 20]. For the comparison, 
we apply a coarse-level grid with refinement level t'min = 12, corresponding to 2048^ 
coarse cells. New refinements are triggered on a cell-by-cell, recursive basis when a 
cell collects more than 8 particles. Using this strategy we reach a maximum level of 
refinement £max = 18, corresponding to a spatial resolution of 2 /i“^kpc. We employ 
adaptive, level-by-level time-stepping, with a time step size set smaller than the local 
free fall time, and by the requirement that a particle cannot move more than half a cell 
within one time step. The convergence criterion for the Poisson solver, defined as the 
ratio of the residual L^-norm to the right-hand side L^-norm, is set to e = 10“^. 

• The gravity code PkdgravS [an earlier version of which is described in 21] is based on 
a binary tree algorithm using fifth order fast multipole expansion of the gravitational 
potential (using cell-cell interactions making it an 0{N) gravity calculation method). 
Periodic boundaries conditions are calculated with the Ewald summation technique, 
requiring very little data movement while exposing a high degree of parallelism. Pkdgrav 
has been extensively used for A^-body simulations in the past, mainly in the context 
of cosmological zoom simulations such as Via Lactea [22] and Ghalo [23]. The current 
version of Pkdgrav includes GPU acceleration for all force calculations, leading to a 
significant speed-up with respect to previous versions. In this paper, we use the run 
parameters e = 0.02 /mean (where /mean is the mean particle separation) and 6 = 0.7 
{9 = 0.55 above redshift two) for softening and tree opening criteria. The adaptive 
time-stepping is parametrised in the standard way, i.e dti = r/y^e/|aj| with i] = 0.15 
(where ai is the acceleration of particle /). PkdgravS also has a more sophisticated 
time-stepping criterion based on an estimation of the local dynamical time. 

• The tree-particle-mesh code Gadgets applies a uniform particle-mesh method at large 
scales plus a first order oct-tree technique at small scales [see 24, for a description of an 
earlier version of the code]. Gadget is extensively used in many contexts and is most 
known for the Millennium suite of cosmological simulations [25], as well as the Aquarius 
zoom simulations [26]. For the comparison, we use a comoving Plummer-equivalent 
softening length of e = 10 /i“^kpc and the code’s relative tree opening criterion with 
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a tolerance value of a = 0.005 for the gravitational force accuracy [see 24, for more 
information]. Furthermore, we adopt a time integration parameter corresponding to rj = 
0.22 for the time-stepping criterion used above in PkdgravS. The long-range particle- 
mesh forces are calculated with a 2048^ Fourier grid. 

All simulations used of this comparison were performed on the hybrid CPU/GPU cluster 
Piz Daint at the Swiss National Supercomputing Centre (CSCS). The total run-time for the 
three codes is 94 352 node-hours for Ramses, 34 524 node-hours for Gadgets, and 1632 node¬ 
hours for PkdgravS (the former two codes were run on 512 nodes using CPU only, while the 
latter was run on 128 nodes using full GPU acceleration for the force calculations). On each 
node of Piz Daint there are 8 CPU-cores and one Nvidia Tesla K20X GPU accelerator. 


2.2 Definitions 

Before discussing differences between the gravity codes, we give a brief definition of the power 
spectrum and cross power coefficient (definitions of the propagator and the bispectrum can 
be found in Appendix B). Let us first assume we have two density fields (5j)c(k) and (5y(k) in 
Fourier space. The power spectrum Pxvik) is then defined as (see e.g. Ref. [27]) 

(,5x(k)<5y(k')) = <5fl(k + k')PxY{k), (2.1) 


where 5 d(x) is the three dimensional Dirac delta function. Eq. (2.1) defines both the auto 
and the cross power spectrum, which we now briefly discuss. The auto power spectrum is 
given by 

P{k) = Pxx{k) (2.2) 

and provides a measure of the density amplitudes at different A:-modes. The cross power 
spectrum PxY (with X ^Y), on the other hand, also measures the phase differences between 
density fields. It is convenient to the define the cross power coefficient (e.g. [28-30]) 


rxY{k) = 


PxYjk) 

^Pxx{k)PYY{ky 


(2.3) 


which only contains information about phase shifts while all amplitudes are factored out. This 
becomes evident if we split the perturbation field into an amplitude and a phase component, 
i.e. (5_x(k) = Ax(k) exp [i(px{k)] (see e.g. Ref. [31]). The fact that the cross power coefficient 
measures the spatial shifts between density fields makes it an interesting alternative indicator 
for a code comparison. 


2.3 Auto power spectrum 

Analysing the accuracy of the auto power spectrum P{k) from numerical simulations is the 
main goal of this paper. The measurement of P{k) is performed with fully mpi-parallel Fast 
Fourier Transform (FFT) using the triangular shaped cloud (TSG) method to assign particles 
on the grid^. 

The resulting power spectra are shown in Fig. 1, where different panels correspond to 
different redshifts z = 3.8, 2, 1.0, and 0.0. The green lines refer to PkdgravS, the red lines to 

^In order to avoid smearing effects, we normalise the density contrast in fc-space with the Fourier transform 
of the assignment window (see e.g. [32-34] for more information). 
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Figure 1. Comparison of auto power spectra from the three different fV-body codes at different red- 
shifts. Green lines correspond to PkdgravS, red lines to Gadgets, and blue lines to Raunses (reference 
lines). One percent agreement (indicated by the grey band) is obtained for fc < 1 /iMpc“^ (dashed 
vertical line). 


Gadgets, and the blue lines to Ramses^. One percent agreement (grey shaded area) between 
the different codes is obtained up to A; ~ 1 /iMpc“^ over all redshifts, as illustrated by the 
vertical dashed line. In the highly nonlinear regime from A: = 1 to 10 /iMpc“^, the agreement 
between codes is at the three percent level for z = 1 and below®. At higher redshifts, the 
discrepancy grows, reaching about five percent at z = 2 and ten percent at z = 3.8. 

The agreement between the codes is significantly better than in previous code comparison 
projects by Heitmann et al. [11, hereafter H08] and Heitmann et al. [12, HIO] illustrating the 
progress in code development over the last five years. At very large scales we obtain maximal 
differences of ~ 0.4 percent between codes, with respect to ~ 3 percent in H08 and ~ 2 

®We have chosen the blue lines to act as reference, solely because it lies between the green and red lines at 
2 = 0, therefore improving the readability of the plots. 

®While Ramses and PkdgravS show percent agreement until fc ~ 7 /iMpc“'^, Gadget slightly deviates at 
fc ~ 1 /iMpc“^. 
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percent in HIO'. At small scales beyond fc ~ 1 li/Mpc, the prominent systematic offset 
between PM-codes and tree-codes visible in H08 (with more than 10 percent difference at 
A: ~ 10 /iMpc“^) has now entirely disappeared. 

The relatively large difference between Pkdgrav3/Gadget3 and Ramses at z=3.8 can be 
explained by the fact that AMR codes reqnire many more particles to resolve haloes. As we 
will see in the following sections, the power spectrnm is dominated by gronp-sized haloes at 
redshift zero, which are well resolved in onr simnlations. At high redshift, however, the signal 
stems from considerably smaller strnctnres which are better resolved with the tree-codes than 
with an AMR technique. Higher resolntion of the AMR grid is reqnired to remedy this. 

The resnlts of the code comparison are based on the standard rnn parameters described 
above. These parameters have never been systematically tested in the context of large-scale 
cosmological simnlations, bnt they emerged via many different code applications in the past. 
However, finding more optimal code parameters is non-trivial becanse the trne power spectrnm 
is not known a priori . Parameters cannot simply be tnned to achieve maximal agreement 
between codes since this conld lead to convergence towards the wrong answer. 

It is nevertheless important to qnantify the dependency of the rnn parameters on the 
resnlting power spectrnm. Only resnlts which are insensitive to the choice of code parameters 
can be trnsted. In Appendix A we stndy the effects of the most sensitive code parameters, 
which are the size of the PM grid for Gadget3 as well as the time-stepping criterion for all 
three codes. We conclnde that reasonable variations of these parameters lead to snb-percent 
effects on the power spectrnm below A: ~ 10 /iMpc“^, smaller than the observed differences 
between codes visible in Fig. 1. 

Snmming np, the code comparison snggests that the consensns between different N- 
body techniqnes is good, however not qnite good enongh for the targeted percent accnracy 
np to /c ~ 10 hMpc“^. Fnrther improvements to the codes will not be easily achievable 
as the correct solntion for the matter power spectrnm is not known. A common effort of 
the commnnity is reqnired to converge towards a generally accepted solntion. In order to 
enconrage fnrther comparison of Ai-body codes, we release the IC file nsed here pins all power 
spectrnm measnrements on www. ics.uzh. ch/~aurel/. 

2.4 Cross power spectrum 

The cross power coefficient rxY (defined in Eq. 2.3) qnantifies the spatial shifts between two 
density fields and is therefore a sensitive statistical measnre to compare A^-body codes. While 
the anto power spectrnm only gives information abont the amplitnde of perturbations, the 
cross power coefficient measnres the relative phase-shifts for any given A:-mode. The cross 
power spectrnm is obtained by separately Fonrier-transforming the two density fields from 
different A^-body codes. 

In Fig. 2 we plot the cross power coefficients based on density fields from Gadget3- 
Pkdgrav3 (brown), Gadget3-Ramses (magenta), and Pkdgrav3-Ramses (cyan) at redshift 2 
(top) and redshift 0 (bottom). At the largest scales {k < 0.5 /iMpc“^), the resnlts are in 
perfect agreement, which can be explained by the fact that the cross power coefficient is 
independent of the growth factor and therefore insensitive to errors related to global time- 
integration. At smaller scales and especially at low redshift, the deviations are considerably 
larger than the differences observed in the power spectrnm. This is dne to the effect of gravity 
which magnifies deviations of phases over time, something that is clearly visible in Fig. 2. 

^The better agreement between PkdgravS and Gadgets is (at least partially) because of a new implemen¬ 
tation of the periodic boundary conditions in Pkdgrav. 


6 




_^J_I. .I 

10 '^ 10 ° 
k [h/Mpc] 


Figure 2. Cross power coefficient (as defined in Eq. 2.3) at redshift two (top) and redshift 0 
(bottom). Brown, magenta, and cyan lines correspond to the combinations of density fields from 
Gadget3-Pkdgrav3, Gadget3-Ramses, and Pkdgrav3-REimses. 


In general, the cross power coefficients from different code combinations are in good 
agreement with each other. At redshift 2, there are no visible phase-shifts up to /c ~ 2 
h/Mpc. At smaller scales some small differences start to appear, while the density fields 
from Gadgets and Ramses seem to agree somewhat better with each other than with the 
density field from PkdgravS. At redshift zero, phases-shifts start to be visible above k ~ 0.5 
/iMpc“^. The largest differences are observed between the density fields of Gadgets and 
PkdgravS, which is surprising given the fact that they use similar numerical techniques at 
small scales. 

In summary, we want to highlight the extremely good agreement of the cross power 
coefficients below k ~ 0.5 h/Mpc suggesting vanishing force errors at large scales. As a 
consequence, the sub-percent differences visible in power spectrum at large scales (see Fig. 1) 
have to come from slightly different growth factors and are therefore potentially stemming 
from small inaccuracies in the global time-integration schemes. Finally, we want to stress 
that both the auto and cross power spectra shown in this paper only provide measures for the 
relative differences between codes but do not indicate which one of the three codes is most 
accurate. 


3 Testing the iV-body pipeline 

Potential inaccuracies of numerical simulations are not restricted to the A^-body code but can 
stem from the initial conditions, limited box-size or physical resolution. Each of these sources 
of error has been extensively studied in the past (see e.g. [12, 19]) . Here, we reanalyse potential 
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Figure 3. Ratios of power spectra from simulations with ZA and 2LPT initial conditions and different 
starting redshifts (zi). The top panel shows measurements at redshift Zi, the bottom panel at redshift 
zero. One percent agreement is illustrated by the grey band. 


effects from initial conditions, box-size and resolution with the focus on the requirement of 
sub-percent errors. 

3.1 Initial Conditions 

Initial conditions of cosmological simulations are generated as a random realisation of a (Gaus¬ 
sian) density field, based on either first or second order Lagrangian perturbation theory. The 
density field is usually discretised in form of aligned particles on an initial grid, where small 
displacements account for the initial perturbations. 

The redshift of the initial conditions has to be chosen with care. It should lie in a range 
where all resolved perturbations are large enough to dominate numerical noise, but still small 
enough to be accurately described by perturbation theory. It has been shown in the past that 
it is advantageous to use second order Lagrangian perturbation theory (2LPT) with respect 
to the simpler first order or Zel’dovich approximation (ZA), as it allows for smaller starting 
redshifts, further away form the noise dominated high-redshift regime [13, 35-37]. 

We study the effects of the initial conditions on the power spectrum at redshift zero by 
running simulations with L = 512 h“^Mpc and N = 1024 particles per dimension with the 
Wbody code PkdgravS. The initial conditions are generated with MUSIC [38], using both the 
ZA and 2LPT approach at different starting redshifts (zj). 

The resulting effects on the power spectrum are illustrated in Fig. 3. In the top panel, 
we show the ratios between ZA and 2LPT directly measured in the ICs at the corresponding 
starting redshifts of Zi = 200 (red line), Zi = 100 (green line), and Zi = 49 (blue line). The 
differences between ZA and 2LPT are at sub-percent level (converging towards large Zj) and 
limited to high wave numbers above fe ~ 1 /iMpc“^. In the bottom panel, we show the 
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same ratios now measured at redshift zero. The differences between ZA and 2LPT have 
grown substantially affecting wave number beyond fe ~ 0.1 /iMpc“^. This behaviour is in 
agreement with previous findings [13, 39, 40]. 

Fig. 3 suggests that a starting redshift Zi > 200 is required to obtain percent accuracy 
with ZA initial conditions. Such high starting redshifts are prone to numerical problems, since 
A^-body codes do not deal well with extremely small initial density perturbations. At what 
redshift numerical effects become a problem depends on the code and the run parameters. 
Based on a study involving Pkdgrav2 and Gadget2, Reed et al. [36] concluded that the initial 
redshift and the redshift of typical halo formation should not differ by more than a factor of 
fifty. For the cosmological boxes investigated here, most haloes form around redshift two [see 
for example 41] which results in the requirement Zi < 100®. 

In agreement with previous results, we conclude that initial conditions with 2LPT should 
be used consistently for cosmological simulations. They are significantly more accurate than 
ZA initial conditions and they allow lower starting redshifts, thus decreasing the run-time of 
simulations. 


3.2 Box size and resolution 


A careful setup of simulations in terms of box size and particle numbers is crucial in order to 
obtain one percent agreement in the power spectrum. Small boxes tend to suffer from sample 
variance and missing large-scale modes, while large boxes might not have enough resolution 
to capture the very nonlinear scales. 

It is straight-forward to determine the expected statistical (Gaussian) error which con¬ 
sists of a sample variance and a shot-noise contribution and is given by 

( 3 . 1 ) 


where AA^m = {2'k^') is the number of modes per fc-bin and Pgn = {L/N)^ is the 

Poisson shot-noise. From Eq. (3.1) it becomes obvious that the sample variance increases 
towards larger scales. Enforcing sub-percent statistical errors and assuming Ak = 27r/L 
results in the condition 


^ k 


(3.2) 


for the box length L. This means that a minimal box length of L = 2.5 h~^Gpc is required to 
beat down the sample variance below the percent level for /c-modes above k ~ 0.1 /iMpc“^. 
The shot noise contribution Psn, on the other hand, becomes important at the smallest scales. 
In this paper we subtract Poisson shot-noise from all measured power spectra and we indicate 
the scale above which shot-noise contributes at more than one percent to the total power 
spectrum. 

Next to statistical errors there are systematical effects due to finite volume and resolution 
of the simulation setup as well as the nonlinear nature of gravity. These errors are more 
difficult to quantify and we will focus on providing estimates of how they can be minimised 
to sub-percent level. 

We use Pkdgrav3 to run a suite of numerical simulations with varying box size and parti¬ 
cle number. Volume effects on the power spectrum are investigated by comparing simulations 


®Recent tests with PkdgravS show that high-redshift errors can be reduced by choosing a smaller tree¬ 
opening parameter during the first gravity steps. This could potentially allow to shift the starting redshift to 
higher values for the same precision requirements. 


9 





Figure 4. Investigating box-size and resolution effects on the power spectrum. Left panel: Same 
physical resolution and increasing box size: L=128 h“^Mpc (blue), L=256 /i“^Mpc (green), L=512 
h“^Mpc (red), L=1024 /i“^Mpc (black, reference line). Right panel: Same box size (L=512 /i“^Mpc) 
and increasing number of particles (per dimension): N=256 (blue), N=512 (green), N=1024 (red), 
and N=2048 (black, reference line). The coloured areas quantify the statistical errors from sample 
variance, while the grey shaded band highlights the range of percent accuracy. Dotted vertical lines 
indicate where the shot-noise contribution exceeds one percent. 


with the same mass resolution and different box sizes. Effects due to particle numbers are 
studied with runs of constant box sizes. All simulations are based on 2LPT initial conditions, 
generated with MUSIC at redshift 49. The power spectrum is measured with the triangular 
shaped cloud (TSC) mass assignment on a 8192^ grid. 

In the left panel of Fig. 4, we illustrate ratios of power spectra from runs with the same 
mass resolution but different box sizes and particle numbers at redshift zero. The particle 
mass is kept constant at Mp ~ 10^*^ while the box length is increased together 

with the number of particles. A small box with L = 128 /i“^Mpc (blue line) systematically 
underestimates the power by more than 10 percent. Doubling the box size to L = 256 
/i“^Mpc box (green line) leads to an overall accuracy of 5 percent (one percent at small 
scales, k > 1 /iMpc“^). Boxes with length of L = 512 /i“^Mpc (red line) and more (L = 1024 
/i“^Mpc, black reference line) differ by about one percent or less over the entire range of 
wave numbers. We therefore conclude that the box size of simulations should not be smaller 
than 500 /i“^Mpc in order to eliminate all systematic nonlinear hnite-volume effects at the 
required percent precision (see also [42, 43] for similar conclusions). Reducing the sample 
variance below the percent level over the entire A:-range illustrated in Fig. 4 would require a 
much larger box of L 12 h-^Gpc. 

In the right panel of Fig. 4, we plot power spectra from runs with the same box size 
[L = 512 /i“^Mpc) and different particle numbers, effectively increasing the mass resolution. 
Simulations with N = 256 (blue line), N = 512 (green line), and N = 1024 (red line) 
underestimate the power on small scales with respect to the N = 2048 reference run (black 
line). The convergence rate with respect to the percent accuracy requirement is directly 
proportional to the scale where the simulation shot-noise becomes relevant, i.e. the wave 
number kgn at which Psn/P = 0.01. (illustrated by the dotted vertical lines in Fig. 4). The 
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maximum wave number fcmax warranting percent accuracy is well described by /Cmax = ^sn/3. 
For the runs shown in the right panel of Fig. 4 with N = 256, 512, 1024 (blue, green and red 
lines) this results in femax = 0.2, 1.0, 4.0 /iMpc“^. Previous investigations by Refs. [12, 44] 
have proposed femax to be half the Nyquist frequency {k^y = irN/L) instead. For the same 
runs this would lead to fcmax = 0.78, 1.58, 3.14 /iMpc“^ which does not exactly reproduce 
our results®. 

The drop in power of low resolution runs, visible in the right panel of Fig. 4, can be 
explained in terms of analytical considerations: experiments with the halo model show that 
clusters significantly contribute to the power spectrum, while the presence of small haloes 
below 10^^ have a negligible effect [45, 46]. Since the simulations with lower resolution 

(represented by the coloured dots) do not resolve haloes down to masses of 10^^ they 

underestimate the physical power at small scales. The N = 2048 simulation on the other 
hand, has a particle mass of Mp 10® h ^Mq, resolving 10^^ h ^Mq haloes with ~ 100 
particles. Moreover, the convergence rate in the plot suggests that the N = 2048 run is one 
percent accurate until A: ~ 10 ZiMpc”^ at redshift zero. 

Based on the right-hand-side panel of Fig. 4, we can determine a minimal mass resolu¬ 
tion required to obtain percent convergence in the matter power spectrum. Since the runs 
illustrated by the blue, green, and red lines underestimate the power by more than a percent 
for values above k = 0.25, 1, 4 /iMpc“^, we can safely expect the black line to depart from 
the true answer beyond k = 10 h Mpc“^. A conservative estimate therefore yields a maximum 
simulation particle mass of Mp = 10® guaranteeing a percent converged power spec¬ 

trum at all scales up to /c = 10 /iMpc“^. This requirement can be relaxed to Mp ~ 8 x 10^® 
for wave numbers up to A: = 1 hMpc~^ (as shown by the green line). Numerical 
simulations for upcoming survey missions need large boxes of at least 4 /i“^Gpc to cover the 
entire survey volume [1, 2]. This means that at least N = 16000 particles per dimension (i.e 
four trillion in total) are required to reach percent precision for the power spectrum up to 
A; ~ 10 /iMpc“^. 

3.3 Best guess for the power spectrum 

After investigating the convergence with respect to box size and mass resolution, we present 
a suite of four simulations with each N = 2048 per dimension and decreasing box sizes of 
L = 4096, 2048, 1024, and 512 /i“^Mpc. These simulations provide a combined measurement 
of the power spectrum over the entire range of scales from k ~ 0.05 /iMpc“^ to A; ~ 10 
/iMpc“^. 

In Fig. 5 we use the combined power spectrum from these four simulations as reference 
line, where the different colours indicate which simulation as been used for which k-range. 
The colour-shaded areas furthermore indicate the uncertainties due to sample variance and 
the grey-shaded area delimits the range of percent precision. In Fig. 5 the Franken emulator 
[11, solid black line], the halofit model [47, dashed black line], and the linear theory (dotted 
black line) are plotted against the reference line from our simulations. The halofit model 
is a revised version of the Smith et al. [48] fitting scheme, which is physically motivated by 
the halo model and claims to be 10 percent accurate for A: < 1 /iMpc“^ between z = 0 
and z = 10. Compared to our simulations the agreement is better than 5 percent over all 

^Assuming a power-law dependence of the power spectrum, P{k) oc k~°‘, the scaling of fcmax = fesn/S goes 
as fcmax oc . For the asymptotic limit of a = 3 both approaches - the convergence scale to be tied to 

the shot noise or to the Nyquist frequency - yield the same scaling with k. For a < 3, however, fesn converges 
somewhat faster, which is in better agreement with our simulations. 
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Figure 5. Power spectra of the cosmic Franken emulator [49], the revised halofit function [47], and 
the linear prediction compared to the outcome of simulations with N = 2048 particles per dimension 
and varying box size L=512 /i“^Mpc (magenta), L=1024 /i“^Mpc (red), L=2048 /i“^Mpc (green), 
L=4096 /i“^Mpc (blue). Coloured areas quantify the statistical errors from sample variance and 
shot-noise, the grey shaded band highlights the range of percent accuracy. 


measured scales and redshifts. The cosmic Franken emulator is an interpolation tool based on 
a suite of simulations with varying cosmological parameters [12, 49]. The agreement between 
the emulator and our simulations is about three percent at z = 0 and hve percent at z = 1. 
This is roughly within the stated accuracy of Heitmann et al. [49]. However, our simulations 
consistently predict more power than the cosmic emulator at scales around A: ~ 1 /iMpc“^ 
and above, conhrming results from Skillman et al. [50] who observe a similar departure from 
the cosmic emulator. Part of the difference should come from the fact that the emulator 
was calibrated with Gadget runs, while we use Pkdgrav, two codes that differ by about three 
percent at A: > 1 /iMpc“^ as illustrated in Fig. 1. 
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4 Conclusions 


The future of cosmology relies on data from large scale structure surveys. This data can 
only be fully exploited if we understand gravitational clustering and galaxy formation at high 
accuracy. The matter power spectrum, as the prime statistical measure, needs to be known 
within percent precision from linear scales up to /c ~ 10 /iMpc“^. 

Although cosmological A^-body techniques have been developed and constantly improved 
during the last two decades, obtaining the required accuracy remains a challenge. The entire 
pipeline from the generation of initial conditions to the analysis of the final data needs to be 
examined carefully and potential sources of error have to be quantified. 

In this paper, we compare power spectra of simulations from the three gravity codes 
Ramses, PkdgravS, and Gadgets. These codes are well established in the community and rep¬ 
resent common A^-body techniques for cosmological simulations: the particle-mesh technique, 
the tree method, and a hybrid combination of the two. In a second part, we explore potential 
error sources from initial conditions, simulation volume, and resolution, investigating effects 
on the matter power spectrum. These findings are then expressed in terms of a minimal 
volume and minimal mass resolution requirement to obtain the targeted percent accuracy. 

The main results of the paper can be summarised as follows: 

1. Gravity calculation: The gravity codes Ramses, PkdgravS, and Gadgets agree within 
one percent up to /c = 1 /iMpc“^ (over all studied redshifts), and within three percent 
up to /c = 10 /iMpc“^ (below redshift one). Increasing the accuracy of the global time 
integration is likely to further reduce errors at the largest scales (as suggested by our 
analysis of the phase spectrum and the bispectrum). Things are likely to be much more 
challenging at small scales, as there is no reference solution for the nonlinear power 
spectrum. 

2. Simulation volume: A box size larger than L ~ 0.5 /i“^Gpc is needed to avoid biases 
from nonlinear finite-volume effects at the percent level of the power spectrum. Reduc¬ 
ing the Gaussian sample variance to a sub-percent level for the /c-range above k = 0.1 
/iMpc“^ would require an even larger box size of L ~ 2.5 /i“^Gpc. 

3. Mass resolution: A conservative estimate of the maximum particle mass in simulations 

yields Mp = 10® for percent accurate power spectra up to /c = 10 /iMpc“^. 

This requirement can be relaxed to Mp ~ 8 x 10^® /i“^Mq, if only wave numbers up 
to k = 1 /iMpc“^ are considered. Upcoming surveys, such as DES, LSST, and Euclid, 
require large simulation volumes of L ~ 4 /i“^Gpc or more. As a consequence, numerical 
simulations need to have at least N ~ 16000 particles per dimension (i.e four trillion in 
total) to reproduce the power spectrum at targeted accuracy. 

4. Initial conditions: Initial conditions based on the Zel’dovich approximation (ZA) require 
very high starting redshifts of Zi = 200 or above. Such high redshifts are prone to nu¬ 
merical errors, since the size of perturbations are of the order of the numerical accuracy. 
Initial conditions based on second order Lagrangian perturbation theory (2LPT) are 
significantly more accurate. They allow late starting redshifts, reducing the run-time of 
simulations and minimising potential numerical errors in the high redshift regime. 

Summarising these results, it is possible to run cosmological simulations with sub-percent 
errors from volume and mass resolution effects, however, at the price of very high particle 



numbers. In terms of the gravity calculation, the agreement between codes is good, but not 
quite at the percent level for the very nonlinear regime. 

In the future, it will be crucial to include baryonic effects driven by AGN feedback, as 
they have been shown to significantly affect the matter power spectrum at nonlinear scales 
[5, 51]. Quantifying and parametrising the AGN feedback will be one of the main challenges 
of computational cosmology, and a basic requirement to take full advantage of the upcoming 
large scale structure observations. 

Data Release 

All relevant data of the code comparison project, i.e. the IG file, run parameters, and power 
spectra measurements, can be found at www.ics.uzh.ch/~aurel/. We hope that this in¬ 
formation will be useful for future comparison and accuracy tests including other Wbody 
codes. 
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A Variation of code parameters 

In the main text, we nse standard code parameters from the literatnre to compare the different 
gravity codes. This is jnstihed becanse the exact solntion of the power spectrnm is not known 
at nonlinear scales, and a posterior adjnstment of code parameters wonld lead to a false 
impression of convergence. It is nevertheless important to qnantify how the choice of code 
parameters affects the hnal resnlts. 

In general, code accnracy parameters can either be attribnted to the force calcnlation 
or the time-stepping. Typical parameters regnlating the force accnracy are softening-length 
and opening-angle for tree-codes (snch as PkdgravS) as well as grid rehnement strategy and 
accnracy of the Poisson solver for adaptive PM codes (snch as Rainses). Hybrid codes (snch 
as Gadgets) nsnally have an additional parameter regnlating the transition scale between 
the PM and tree regime (PM-grid). The accnracy of time integration, on the other hand, is 
nsnally controlled by the adaptive time-stepping criterion which is implemented in a similar 
way in all three codes. 

Past work has shown that for tree codes softening and tree-opening criteria show percent 
convergence at /c < 10 /iMpc“^for reasonable parameter choices [36, 52]. The same seems 
trne for the force accnracy parameters of adaptive PM codes which have shown to yield the 
same precision than generic tree-codes [16]. More signihcant deviations are reported for the 
transition parameter between the PM and tree regimes in hybrid codes [52] and for the time¬ 
stepping criterion affecting all three codes [36]. In the following, we investigate the effects of 
both time-stepping and PM-grid transition on the resnlting power spectrnm. 

In order to test the effect of time-stepping, we rnn simnlations with alternative time¬ 
stepping criteria for all three codes of the comparison project. For Ramses and Gadgets 
we nse the global time-stepping mode as an alternative, where all particles trajectories are 
integrated with the smallest time-stepping of the adaptive (defanlt) mode independently of 
their gravitational acceleration. For Pkdgrav we keep the adaptive natnre of time-stepping 
and vary the time-stepping parameter rj aronnd the defanlt choice t] = 0.15. 

The impact of the time-stepping on the power spectrnm is illnstrated in the left panel 
Fig. 6. Switching to global time-stepping affects the resnlt at the percent level aronnd A: ~ 10 
/iMpc“^ for both Ramses and Gadgets, however, with an inverse general trend (redncing 
power for Ramses and increasing it for Gadgets). Varying the r] parameter in PkdgravS 
aronnd rj = 0.1 — 0.2 also leads to a percent effect on the power spectrnm at A: ~ 10 /iMpc“^ 
(bottom panel), with the general trend of increasing power for smaller time-steps. Based 
on these tests we conclnde that the resnlts from the code comparison (i.e., Fig. 1) are not 
sensitive to the time-step criterion as long as reasonable parameter choices are considered. 

The effect of the PM-grid transition in Gadget is investigated by rnnning simnlations 
with different size of the PM grid aronnd the defanlt choice (where the nnmber of grid points 
eqnals the particle nnmber, i.e., PM-grid = N). The resnlting power spectra are illnstrated in 
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Figure 6. Left: Different time-stepping strategies and their effects on the auto power spectrum. For 
Rcunses (top) and Gadgets (centre) we show adaptive (default) and global time-stepping, for PkdgravS 
(bottom) we vary the time-stepping parameter r] around the default value 77 = 0.15. Right: Varying 
grid size of the PM-mesh in Gadgets and how this affects the auto power spectrum. 


right panel of Fig. 6 , showing differences at the percent level over varions scales. The size of the 
scatter seems signihcant for precision cosmology and reqnires fnrther investigation. However, 
the variation is not large enongh to explain the offset between Gadget and Ramses/Pkdgrav 
in the z = 0 panel of Fig. 1. 

We have shown in this appendix that changing the time-stepping criterion of our codes 
has a sub-percent effect on the auto power spectrum below A: ~ 10 h Mpc“^. The error induced 
by the PM-tree-transition in Gadget is slightly larger but still roughly below one percent. As 
argued above, other parameters, such as softening and tree opening for tree-codes as well as 
the accuracy of the Poisson solver for mesh-codes are expected to yield even smaller errors. 
We therefore conclude that simple tuning of parameters is not enough to bring the different 
codes into sub-percent agreement. Deeper investigations of the discretisation and integration 
techniques might be required to achieve this goal. 

B Beyond the power spectrum: propagator and bispectrum 

In this appendix we consider a different set of statistics from the main text to have a deeper 
understanding of the differences between the A^-body codes and also illustrate the robustness 
of the results obtained so far. The hrst is the propagator G{k) which results from the cross¬ 
correlation of the initial conditions ho (common to all codes) with the density fluctuations 5 
at a given redshift, 

(h(k,z)ho(kO) 

(ho(k)ho(kO) • 

This two-point statistic is sensitive to the displacement of particles away from their initial 
conditions on scales signihcantly larger than 2TTjk, unlike the equal-time power spectrum 
considered in the main text. At leading order in perturbation theory valid at large scales 
G{k) agrees with the growth factor, whereas at small scales the propagator drops to zero on 
scales smaller than the inverse of the rms displacement held at the given redshift [53]. 
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Figure 7. Left: The solid lines show the ratio of Gadgets and PkdgravS propagators to the Raunses 
propagator at z = 2. The dashed lines denote the expectation of these ratios based on RegPT. Right: 
Same as left panel, but at z = 0. 


Since we have only one realisation for each A^-body code, it is difficult to conclude 
anything from comparing the propagators to their expected large-scale limit, e.g. computed 
using RegPT [54], For example, looking at the first five bins in Fourier space, all three codes 
agree with the predictions to better than 0.5% at z = 2 but there is no clear winner in terms 
of best agreement as the measurements fluctuate about the theoretical result as k changes. 
For this reason it is more robust to look at the ratio between different codes and explore to 
what extent the differences between them can be understood. 

In Fig. 7 we show the ratio of the measured propagators to that of the Ramses code, at 
z = 2 (left panel) and z = 0 (right panel), shown from the fundamental mode of the simulation 
box up to scales where the propagator drops (exponentially) to zero. There are three points 
worth making here. First, the low-fc deviations in the propagator ratios are largely consistent 
(half the value) with those seen in the power spectrum in Fig. 1. The second point is that 
a low-fc enhancement (suppression) goes together with a high-fe suppression (enhancement) 
of the propagators. This makes sense as a low-fc enhancement corresponds to an overall 
larger displacement field, which also decreases the propagator at small scales as the cross¬ 
correlation between initial and final conditions is suppressed by what is, effectively, slightly 
more time evolution. Finally, the redshift dependence shows that while the relative behaviour 
of PkdgravS and Ramses is the same at z = 0 and z = 2, the Gadgets deviations compared 
to Ramses have opposite signs at the outputs considered here. 

The figure also shows the expected propagator ratios from RegPT (dashed lines). At 
z = 2 the expectation works very well for the ratio of Gadgets to Ramses propagators, but 
not so well for the PkdgravS to Ramses, which also shows significant more noise, particularly 
at the fundamental mode of the box. Clearly, the early (up to z = 2) time evolution of 
Gadgets and Ramses are consistent with each other except for some small (less than 0.1%) 
relative displacement, while the evolution of PkdgravS is not as consistent with Ramses (or 
Gadgets) in terms of an overall slight displacement mismatch (as shown by comparison with 
RegPT and the relative fluctuations. This is perhaps not surprising, as the large-scale forces 
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Figure 8. Left: The ratio of Gadgets and PkdgravS reduced bispectra to the Ramses reduced 
bispectrum at z = 2. The reduced bispectrum in each case corresponds to an average over all 
triangles with maximum wavenumber equal to k. Right: Same as left panel, but at z = 0. 


are computed by the PM method in both Ramses and Gadgets, while a tree is used in the 
PkdgravS case. As the evolution proceeds to z = 0, however, the relative evolution of Gadgets 
to Ramses drifts and changes sign (at z = 0 Gadgets is slightly less evolved than Ramses) 
with this relative behavior still fairly well predicted by RegPT. For PkdgravS the low-Zc noise 
remains, but the relative evolution to Ramses appears much more consistent than it was at 
z = 2 to what is expected from RegPT. 

We now consider the bispectrum. Given the discussion above, we expect that the dif¬ 
ference in the bispectra obtained from the different simulations to differ mostly by an overall 
constant growth factor, and since to leading order in perturbation theory the bispectrum 
scales as the power spectrum squared, the difference in amplitude between bispectra should 
be about twice that observed in the power spectrum in Fig. 1 (or four times that in the prop¬ 
agator in Fig. 7). For this reason, it is convenient to show results for the reduced bispectrum 
Qi 23 defined as, 


Qi23 = 


_ Bi23 _ 

Pi P2 + P2P3 + P3 Pi ’ 


(B.2) 


where Pi = P{ki) and ((5(ki) <5(k2) (^(ks)) = <5£)(ki -|-k 2 Tks) B 123 with B 123 the bispectrum. 
The reduced bispectrum is, to leading order in perturbation theory, independent of the overall 
value of the growth factor. 

Figure 8 shows the results for the reduced bispectra at z = 0,2. We have measured 
the bispectrum for all triangle shapes from scales of twice the fundamental mode of the box 
kf ~ 0.0126 /iMpc“^ up to 160fc/ ~ 2.01 /iMpc“^, using the method discussed in [55]. For 
simplicity we show the results for the reduced bispectra after averaging all triangles whose 
maximum side is k (the horizontal axis in Fig. 8). We see from this figure that the reduced 
bispectra are overall in remarkable agreement at the sub-percent level, while the agreement 
between bispectra (not shown) is at the one-percent level at large-scales (as expected from the 
power spectrum results and the scaling discussed above). In particular, we see that at z = 2 
the agreement between Gadgets and Ramses is essentially perfect until k ~ 0.6 /iMpc“^ and 
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then it shoots up by only 0.4% by /c ~ 2 /iMpc“^. On the other hand, for PkdgravS the 
differences are more noticeable for the low-A; modes (as noted for the propagator and power 
spectrum before). At z = 0, for most triangles at low-A: the differences remain at the 0.2% level 
at most, while at high-A: the remain below 0.5%. Overall these results are consistent with the 
picture discussed above, that the evolution of Gadgets and Ramses are fairly consistent with 
each other up to an overall small mistmatch in growth factors, while the early time evolution 
of PkdgravS differs by a bit more than just an overall scale factor at large scales {k < 0.1 
h Mpc“^). In fact, for the sake of clarity Fig. 8 starts from k = 4A;j, but the ratio of PkdgravS 
to Ramses reduced bispectra for equilateral triangle of sides k = 2kf is as large as ~ 1.05 
at z = 0, 2. Unfortunately there is no reliable way of telling which results, PkdgravS on the 
one hand, or Ramses and Gadgets (which are consistent among themselves), are the correct 
ones as the cosmic variance is significant at these scales for a single realization of a relatively 
small simulation box. At small scales all the codes differ in their reduced bispectra at below 
the one-percent level and the situation is even less clear cut. It seems, however, entirely 
possible that making the large-scale factors agree better than we have here can move towards 
making those discrepancies even smaller, as the reduced bispectrum is affected by overall 
growth in the nonlinear regime. To make this more quantitative, since the rms large-scale 
displacement at z = 0 is of order 10 /iMpc“^, the error seen on it of order 0.1% (see Fig. 7) 
corresponds to 0.01 /iMpc“^ errors on displacements, which of course can give larger than 
percent corrections to the power spectrum for A: > 1 /iMpc“^. In addition, the large-scale 
enhancement of displacements (or growth factors) which correspond to slightly more evolved 
configurations do have an enhancement of the small-scale power spectrum, as expected. 
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